Extracting MWEs from Italian corpora: A case study for refining the POS-pattern methodology
نویسنده
چکیده
An established method for MWE extraction is the combined use of previously identified POS-patterns and association measures. However, the selection of such POSpatterns is rarely debated. Focusing on Italian MWEs containing at least one adjective, we set out to explore how candidate POS-patterns listed in relevant literature and lexicographic sources compare with POS sequences exhibited by statistically significant n-grams including an adjective position extracted from a large corpus of Italian. All literature-derived patterns are found—and new meaningful candidate patterns emerge—among the top-ranking trigrams for three association measures. We conclude that a final solid set to be used for MWE extraction will have to be further refined through a combination of association measures as well as manual inspection.
منابع مشابه
Building a Social Media Adapted PoS Tagger Using FlexTag -- A Case Study on Italian Tweets
English. We present a detailed description of our submission to the PoSTWITA shared-task for PoS tagging of Italian social media text. We train a model based on FlexTag using only the provided training data and external resources like word clusters and a PoS dictionary which are build from publicly available Italian corpora. We find that this minimal adaptation strategy, which already worked we...
متن کاملTED-MWE: a bilingual parallel corpus with MWE annotation Towards a methodology for annotating MWEs in parallel multilingual corpora
English. The translation of Multiword expressions (MWE) by Machine Translation (MT) represents a big challenge, and although MT has considerably improved in recent years, MWE mistranslations still occur very frequently. There is the need to develop large data sets, mainly parallel corpora, annotated with MWEs, since they are useful both for SMT training purposes and MWE translation quality eval...
متن کاملParsing di Corpora di Apprendenti di Italiano: un Primo Studio su VALICO (Parsing Italian Learner Corpora: a Case Study on VALICO)
English. Modern learner corpora are now routinely PoS tagged, whereas syntactic parsing is much less frequent. This paper proposes a first attempt of parsing applied to a subcorpus of VALICO, in an effort to identify key elements to be further used to parse corpora of Italian as a foreign language in
متن کاملCreation of Lexical Resources for a Characterisation of Multiword Expressions in Italian
The theoretical characterisation of multiword expressions (MWEs) is tightly connected to their actual occurrences in data and to their representation in lexical resources. We present three lexical resources for Italian MWEs, namely an electronic lexicon, a series of example corpora and a database of MWEs represented around morphosyntactic patterns. These resources are matched against, and creat...
متن کاملConceptual Structure of Automatically Extracted Multi-Word Terms from Domain Specific Corpora: a Case Study for Italian
This paper is based on our efforts on automatic multi-word terms extraction and its conceptual structure for multiple languages. At present, we mainly focus on English and the major Romance languages such as French, Spanish, Portuguese, and Italian. This paper is a case study for Italian language. We present how to build automatically conceptual structure of automatically extracted multi-word t...
متن کامل